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ABSTRACT 


The FAIR data guiding principles have been recently developed and widely adopted to improve the 
Findability, Accessibility, Interoperability, and Reuse of digital assets in the face of an exponential increase 
of data volume and complexity. The FAIR data principles have been formulated on a general level and the 
technological implementation of these principles remains up to the industries and organizations working on 
maximizing the value of their data. Here, we describe the data management and curation methodologies 
and best practices developed for FAIRification of clinical exploratory biomarker data collected from over 
250 clinical studies. We discuss the data curation effort involved, the resulting output, and the business and 
scientific impact of our work. Finally, we propose prospective planning for FAIR data to optimize data 
management efforts and maximize data value. 


1. INTRODUCTION 


As the pharmaceutical industry evolves to combine the latest advances of medical knowledge and 
technology, the amount and complexity of healthcare data is increasing exponentially with an estimated 
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compound annual growth rate of 36% [1]. Healthcare data volume has reached the zettabyte scale by some 
projections [2] with approximately 80% of the data remaining unstructured and poorly organized [3], thus 
of limited utility for downstream analysis without extensive data curation efforts. 


To address the challenges of data management and stewardship across all data-intensive industries, 
the FAIR (Findable, Accessible, Interoperable, and Reusable) data guiding principles have been recently 
developed by a diverse set of academic, corporate and governmental stakeholders [4, 5]. These principles 
have been quickly adopted by publishers, funding agencies and international intergovernmental 
organizations [6, 7, 8]. Successful implementation of the FAIR principles by the pharmaceutical industry is 
a key prerequisite for digital transformation [9, 10] and the disruptive potential of machine learning and 
artificial intelligence in drug discovery [11]. Additionally, it is highly anticipated that data FAIRification will 
accelerate innovation, aid in the development of personalized medicine, drive down drug development 
timelines, reduce R&D costs and enable data sharing between research groups within and across institutions 
and companies [12]. 


While still in the early stages, industry-wide implementation of the FAIR data guiding principles is 
faced with many challenges, including the costs of FAIRification, limited understanding and availability of 
technology and standards to support FAIR implementation, and the need for cultural change within the 
organizations. Addressing some of those challenges, a bottom-up approach for data integration driven 
by use cases would allow the organization to begin building the basic infrastructure and standards to 
make data FAIR, increase organizational awareness, demonstrate business impact and build capabilities 
for the future. 


Among the different types of healthcare data, biomarker data have played an increasingly significant role 
in drug development in recent decades [13, 14, 15, 16]. Biomarkers are used to understand mechanisms 
of action, evaluate pharmacology (pharmacokinetics/ pharmacodynamics), explain differences in treatment 
responses, and select or stratify subjects. A study of clinical development success rates from 2006 to 2015 
demonstrated benefits of using biomarkers for subject selection alone—a threefold increase in the probability 
of approval from Phase | and 20% increase in the transition to approval from Phase III [16]. While essential 
for utilizing its full value in drug development, FAIRification of biomarker data has its own set of challenges. 
First, there is a broad and continuously evolving diversity of assays—from single immunoassays to complex 
multicolor flow cytometry. Second, assay workflows are diverse as each assay type has unique workflow 
and quality control parameters. Third, clinical sample flow is complex as different specimen types are taken 
at multiple clinical sites, and then the specimens (and subsequently derived samples) are processed and 
sent from the first vendor to downstream vendor #2 and so on, often resulting in a convoluted sample flow 
chart. Fourth, some therapeutic areas, like oncology, are especially complex. Disease pathogenesis and 
heterogeneity, drug response and resistance, genetics and the impact of the immune system, among other 
factors, each needs to be researched and carefully considered to effectively address the complexity of the 
disease, creating a different type of data complexity in the process. 


Traditional workflows have been inadequate for handling the data volume and the complexity of 
biomarker data harmonization and FAIRification, often resulting in significant delays, or even reduced data 
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availability for downstream analysis and lost value altogether. Creating new and more efficient workflows 
for biomarker data FAIRification and integration is therefore crucial to enable the data to be used for future 
exploration of scientific questions and data analysis, meeting regulatory requirements and supporting go/ 
no go decisions at each stage of clinical trial. 


This paper focuses on the methodology and best practices for data curation and metadata mapping 
developed for FAIRification of clinical exploratory biomarker data from several use cases. The biomarker 
data were collected from over 250 legacy studies (defined by the FDA as a set of data that is not developed 
based on current FDA-endorsed data standards), such as the CDISC SDTM data model [17, 18] and on-going 
clinical trials and contained a wide variety of data types including next-generation sequencing (NGS), 
immunohistochemistry, flow cytometry, imaging and mass spectrometry, among other assay types. 


Once the necessary models and processes are in place, a data repository can be developed to share and 
explore the data to enable scientists to submit queries, such as in the following scenarios: 


- For hypothesis exploration, identify all data sets that include “my favorite gene” and report assay and 
study metadata to allow selection of data set(s); 

- For expression quantitative trait loci analysis (eQTL) in a particular disease, identify paired whole 
exome sequencing (WES) and RNA-Seq samples originating from the same specimens; 

- For meta-analysis of treatment effect, identify pre- and post-treatment RNA-Seq fastq files that meet 
provided quality control (QC) criteria and group by laboratory batch. 


While the details of the final implementation may vary depending on user needs, the process described 
in the following sections should remain the same. This paper will share insights and best data management 
practices from our experience with a large and varied biomarker collection and conclude with a discussion 
of future plans. 


2. CONCEPTUAL DATA MODEL FOR BIOMARKER METADATA MAPPING 


Harmonized and linked metadata is a powerful tool for data stewardship as it enables and supports all 
four key principles of FAIR data [4]. One of the crucial steps in the data FAIRification workflow is to 
establish a metadata model that allows efficient data integration and complex queries in the final data 
repository. This is a key element of an open-data ecosystem in which data are valued, preserved and reused. 
As a full data model is out of scope for this work and has been proposed by others [19, 20, 21], we will 
focus on the aspects minimally required as a framework for data FAIRification. The model presented here 
is focused on sufficient metadata to uniquely identify subjects, samples and assays and was developed de 
novo after several iterations. With these in place, links can then be created and maintained between subject, 
sample, assay and data, thus creating a framework that can be supplemented with additional metadata or 
linked to other source systems. 
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2.1 Study Subject 


In the clinical biomarker data environment, a clinical repository is usually the source system for subject, 
treatment, and disease characteristics. As the goal of a minimal metadata model is to maintain sufficient 
metadata to link to source systems where available, in the case of clinical trial data, it is not necessary to 
maintain subject-level metadata (e.g., age, sex, disease) with the biomarker results when a small number 
of identifiers are sufficient to link to the rich source of information in the clinical trial record. Therefore, 
the focus of the FAIRication effort for subjects is to map the study identifiers and subject identifiers assigned 
to the biomarker data to the identifiers in the clinical trial records. This step ensures the validity of data 
through downstream anonymization processes that are often required for re-use or sharing of the data. 


While this may appear to be simple, in reality, these identifiers can take different forms over the life of 
the trial including the use of study names or aliases in place of the unique study identifier, and the re-use 
of subject identifiers from one study to the next. 


2.2 Specimen and Derived Samples 


Tracing sample identifiers reported with biomarker data to the original specimen and subject is one of 
the most time-consuming aspects of data integration. The goal of FAIRification is to capture the sample 
metadata required to uniquely define a sample and to link it to the subject from which it was collected. 
In our case, there was no sample identifier source system that could provide the lineage for all sample 
identifiers, resulting in the need to build that lineage as part of the data curation and mapping process. The 
degree to which the clinical data repository contains biomarker sample metadata varies greatly across the 
industry as well as from study to study within one organization. While the identifier initially assigned to 
the specimen when it is taken from the subject is sometimes recorded in the clinical record, this is not a 
universal practice. Furthermore, as the sample is processed, new identifiers are generated. Sample derivation 
lineage is an important aspect of sample metadata, supporting the need to distinguish technical data 
duplication from data produced from unique samples collected simultaneously. The lineage sample 
identifiers trace a sample minimally to the immediate parent sample. If sample processing practices allow, 
it is optimal to also retain the original specimen identifier, the date of sample collection, the trial visit 
identifier, and the anatomical location from which the specimen was collected. These allow for a robust 
integration with the clinical record. Related to the downstream-derived samples, we record the type of the 
sample and organization that produced it. Lineage sample identifiers and the minimal metadata are shown 
in Figure 1. 
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Figure 1. FAIRification of exploratory biomarker data by linking metadata for subjects, samples, assays and data. 
This minimal logical model of metadata and relationships is a visual representation of the curation outputs. 


2.3 Assays 


Biomarker assays are routinely updated or replaced by newer technology. The minimal metadata for 
assays must capture the differences between related but non-comparable assays. Due to the complex 
variables that define an assay, it is optimal to store assay data in a repository that can serve as a source 
system for linking the data files to the assay definitions. When this work began, an assay repository was 
available but most biomarker assays were not recorded. The curation effort included obtaining assay details 
to enter into the assay repository. While the full model of the assay repository is out of scope for this work, 
we will describe the process of capturing the required metadata to assign the unique assay identifier 
required for the FAIRification process. 


2.4 Relationships 


Once subjects and samples are mapped and assays are defined, the final step is to create the relationships. 
The data files are the products of an assay performed on a sample. The sample identifier and the assay 
identifiers are thus linked to the data file identifiers, as shown conceptually in Figure 1B. In operation, the 
curated tables are generated with file names and file paths, which are used to determine the appropriate 
unique identifier for the file in the biomarker repository. Curation processes for each of these tables are 
discussed in greater detail below. 
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3. LINKING STUDY SUBJECTS AND SAMPLES 


The FAIRification process entails establishing connections between the study subjects and samples taken 
from them as well as relevant metadata for each sample (e.g., sample type, anatomical location, processing 
vendor, collection date-time). Clinical trial biological samples may be obtained when screening subjects 
pre-enrollment as well as during and post treatment. In general, clinical sites send samples to a central 
laboratory for storage and processing. Central labs then ship samples to ‘downstream’ vendors. These 
vendors execute the assays and may further process the samples. Vendors may also be asked to send samples 
to another laboratory downstream. Therefore, a key aspect of the process is tracing the lineage of sample 
derivation beginning with its collection at the study site. However, the lack of standards for maintaining 
linkages between sample, parent sample, subject identifier and sample metadata renders this task labor- 
intensive and resistant to automation. In this section, we describe our process for tracing and maintaining 
the linkage of study subjects and samples. 


In this context, we find it convenient to discuss differences between high-dimensional and low- 
dimensional data harmonization cases. High-dimensional data are categorized here as data generated by 
standardized assays involving high-throughput methodologies, most commonly omics data such as various 
uses of next-generation sequencing. High-dimensional biomarker data sets typically consist of an extremely 
large number of potential markers measured in relatively few samples of subjects. As these methods are 
generally newer and designed for automated downstream analysis, the data organization and formatting is 
usually coherent and consistent. Low-dimensional data are categorized here as the cases where results are 
collected into a data summary file. These aggregate results often incorporate results for multiple assays 
together and are common for low-throughput methods, such as immunoassays (e.g., ELISA, PCR, and some 
histology assays). These summary data files present their own set of challenges for FAIRification. 


3.1 Input Data 


Collaboration with the biomarker operations component of the study teams is critical to access the 
required input data for the linking work. A vital input in the Subject-Sample mapping process is the 
biosample management plan. This is usually presented as a graphic flowchart in a study planning document 
and conveys the samples’ processing history. Subject-Sample data are usually shown as tabular manifest 
files containing a combination of subject identification, sample identification and reference accession 
identification and at times additional sample metadata (e.g., type or date). For high-dimensional data, these 
files are usually processed from each run where each row contains data for a single sample. However, 
sample metadata including subject, visit, or sample type is not readily available from vendor-provided data, 
which hinders linkage efforts. Inputs for low-dimensional data are often assay results files. These are 
generally easier to map to subject and sample identifiers as much of the necessary information is present 
within the file itself. However, there can be more variation in file format than in high-dimensional data, 
making it more difficult to programmatically merge and analyze data even within a single study. Thus, the 
main challenges lie in integrating assay metadata and processing files into a standard format that can be 
easily merged and analyzed. 
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3.2 Challenges 
3.2.1 Legacy Data 


The lack of business continuity is a major challenge in curating legacy studies. No central repository nor 
single owner exists, which hinders access to required information. Closed studies may have out-of-date 
legacy documentation, or documents may be lost altogether when laboratories close, people move and 
antiquated data management systems are phased out. Hence, metadata for legacy studies may not be readily 
available, requiring extensive inventory of available resources. 


3.2.2 Ownership Changes Throughout Clinical Data Lifecycle 


For both high- and low-dimensional data, documentation of the complete lifecycle of the samples is 
crucial to track a sample identifier from an assay back to subject-level information. With high-dimensional 
data, laboratories that perform assays and generate the data often have limited sample information, and 
necessary sample metadata (e.g., subject or visit information) which is commonly missing in the tables used 
to reconstruct a sample's lifecycle. 


3.2.3 Lack of Standardization 


For high-dimensional data, sample metadata is often found in manifest files, which are non-standard and 
therefore difficult to integrate. Manifests are generally manually edited, for instance when samples are 
combined to extract enough DNA. As a result, they contain errors, especially since files are frequently 
provided as Excel spreadsheets which can give rise to formatting errors. In addition, it is not uncommon 
for a variable to contain a mix of values from two different sources (e.g., the screening and enrollment 
identifiers). We have even encountered cases where assay and reference accession identifications are hard- 
coded in raw data file names instead of being present in a manifest. 


Low-dimensional results files come in a large variety of types, including delimited text, Excel, and SAS 
formats. Automated curation scripts must be able to determine the file type to appropriately parse the input 
file. Regardless of file type, data organization can vary greatly from assay to assay or from vendor to vendor. 
A parsed data format is a common example of such inconsistency. While some files follow a tall-skinny 
format, wherein data from a single subject, test, or visit are found in their own row, other files may adhere 
to a wide format, where results from different tests or visits are split across variables. Thus, automated scripts 
must be able to standardize both tall-skinny and wide data into a single standard format for further 
processing. Inconsistencies with variable name and formatting can also occur in both low- and high- 
dimensional data. In some files, variable names are specified on the first row of the input file, and the file 
contains no irrelevant information that must be removed. In other cases, variable names span multiple rows 
or may not begin on the first row of the file. In addition, names for variables that represent the same 
information may differ across files, so automated scripts must be able to standardize variable names for 
important variables to be recognized and processed appropriately. Furthermore, the data themselves may 
be reported in different formats: date values may not be formatted to a common standard and text values 
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representing the same concept may be specified differently. Standardization of each type of data may 
require a separate process. 


3.2.4 Atypical Vendors 


Atypical vendors, such as academic laboratories, may provide primary- and meta-data requiring custom 
scripts or manual curation. For example, we have encountered tissue microarray slide data files described 
so informally in a spreadsheet that interpretation by the curator was required to determine the tissue’s 
source. 


3.2.5 Non-tabular Data 


Finally, some information about the samples is not contained in table form, but is found in reference 
documents, such as sample origin or visit codes, especially when those data are the same across all study 
samples. 


3.3 Process 


To link study subjects and samples, we build the chain of sample identifiers all the way back to the 
central lab that holds a manifest file containing all the metadata. For each sample, we collect all subject 
identifiers, sample identifiers, reference accession identifications and lineage information, sample type, 
anatomic location, date and time of sample collection, visit information and vendor. The Subject-Sample 
mapping process starts with surveying the study data and identifying available sources of metadata. First, 
we locate and manually parse the biosample management plan. We identify the type of identifier used and 
the documentation available for each assay and vendor. If the samples do not originate in a central lab, we 
locate the shipment file that maps the assay sample identifiers to the previous lab identifiers. Despite all 
efforts, some information may still be missing. Ongoing studies have a point of contact from whom we can 
request metadata files that are not readily available, or who can clarify the sample flow. In the case of 
legacy studies, missing data may be irredeemably lost and the curator often must decide how to handle 
each case. 


Many aspects of the curation process can be automated. However, the wide variety of input file types 
and formats renders it difficult for complete automation. Thus, we use a hybrid approach in which we 
intersperse prewritten functions or code chunks that perform common curation tasks with custom code 
required to process a specific data set into the format required by these functions. While parsing available 
documents (e.g., manifests and assay results files) we find that many contain similar types of information. 
However, variable naming conventions may differ significantly. Identifying relevant variables, renaming 
them to convention and removing unnecessary variables can be mostly automated, given a dictionary that 
specifies common variable names and their standardized counterparts. 


Once all available sample information is collected and checked by the curator, the table undergoes QC. 
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3.4 Quality Control 


To check the quality of the output table produced, we established a QC pipeline that includes both 
scripted and human aspects. The automated QC script consists of over 50 checks, including: (1) Ensuring 
fields conform to the appropriate controlled vocabulary, such as anatomical location and vendor names; 
(2) Ensuring date fields are in the correct format; (3) Ensuring all required data are provided; (4) Ensuring 
fields are properly ordered; (5) Ensuring samples have a unique origin; and (6) Ensuring attributes (e.g., date 
and visit) are consistent with those of the downstream samples. For manual curation, a second curator 
(usually more experienced) interprets the results of the QC script, traces the sources of any errors to allow 
for exceptions, validates the sample flow chart interpretation and performs a visual inspection of the curated 
output file and the curation script to identify any potential issues missed by automated QC. Any problems 
identified during the QC process are corrected by modifying the curation script, and a final version of the 
curated output file is generated. 


3.5 Output 


The curated output is written to a standard .csv file. This file contains standard variable names specified 
in a standard order (Figure 1). Data within a given field are standardized according to that field’s respective 
rules. Each row contains a sample identifier, the direct reference accession identification, the central lab 
reference accession identification (if it exists) and sample metadata. 


4. COLLECTING AND STORING ASSAY METADATA 


FAIRification requires a searchable central assay repository that stores sufficient metadata to distinguish 
between related but non-identical assays and assigns a unique identifier to each assay. Many considerations 
behind establishing such a database, like the underlying software component (storage engine) behind its 
management system, are beyond the scope of this paper. One critical component is the level of detail the 
database provides, which would eventually set the limitations on distinguishing assays. These considerations 
must take into account not only the type and diversity of assays that would be registered in the database 
but also the information expected to be available to the curators. The assay repository employed here holds 
a large number of distinct fields. These include (Supplementary Table 1) information regarding the purpose 
of the assay, the methodology, platform and vendor (e.g., ELISA, Simoa HD-1 Analyzer, Myriad RBM), 
analyte examined (e.g., type, gene symbol, UniProt ID), the acceptable tissue or specimen (e.g., plasma, 
tumor, normal) and technical details (e.g., quantitative limits and units). With this level of detail, curators 
can coherently and consistently define assays across studies. Harmonization of the terminologies (such as 
gene and vendor names) allows curators to determine whether a performed assay should be recorded as a 
new assay, or whether it is already recorded in the repository. More importantly, downstream users can 
generate informative queries across multiple studies and projects. Once established, the platform is manually 
populated by teams of curators that scan the available documentation associated with each data set 
collected, and register the assays used. The following subsections discuss available sources of assay 
information, the process of deriving it, and challenges and workarounds developed by curators. 
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4.1 Input/Challenges 


The process for assay registration entails collecting assay information from various sources (detailed 
below) and recording it in a standardized manner. A key issue in this process is determining what constitutes 
an assay, and which parameters and metadata are sufficient for distinguishing one assay from the other. The 
curation process is designed to create and record this metadata and address challenges such as incomplete 
information, multiple sources and formats, and lack of standard vocabularies as detailed below. 


4.2 Input Data 


Input data for collecting and organizing metadata for assays can be found in several types of documents. 
While some of these, such as validated assay protocols, contain all the required information, but others, 
such as sample transfer documentation, typically contain only partial information. Table 1 below details 
some of the information required to uniquely distinguish assays and where to find each piece of information. 
If these sources are unavailable or incomplete, curators can attempt to contact the scientist commissioning 
these assays internally, or the vendor directly. In many cases, the curation team was able to develop 
excellent collaborations with internal biomarker scientists to support the assay work. Despite this wealth 
of sources, curators still encounter several common challenges. 


Table 1. Information required to uniquely distinguish assays. 


Validated Study Sample Com- Inter- Published Vendor 
assay proto- transfer mercial nal paperson or 


protocols cols docu- vendors’ data- the scien- 
mentation websites bases studies tist 
Assay information Name Yes Yes 
Vendor Yes Yes Yes Yes 
Methodology General Method Yes Some Some Some Yes Yes 
Sub-Method Yes Some Some Some Yes Yes 
Detection Method 
Detection/measurement Yes Some Some 
device 
Biomarker/ Universal Identifier (e.g, Some Some Some 
Analyte UniProtID) 
Analyte/Panel name Yes Some Yes Yes Yes Yes Yes 
Analyte type (e.g., Protein) Yes Some Some Yes Some Yes Yes 
Measurement category Yes Some Some Some Some Yes 
Technical details Detection reagent (e.g., Yes Some Some Some Some Yes 
Antibody clone) 
Limits of quantification Yes Some Some Some Yes 
Measurement unit Yes Some Yes Some Yes Yes 
Tissue of interest Species Yes Yes Yes Yes Yes Yes 
Biological matrix Yes Yes Some Some Some Some Yes 
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4.3 Challenges 
4.3.1 Legacy Studies or Partial Information 


Missing data is common in studies conducted prior to the data explosion era as well as in newer studies 
which do not employ modern data management practices. In these studies, assay information is often poorly 
annotated. If enough time has elapsed and the methods are outdated, records may be lost and the personnel 
performing the assays (and sometimes the vendors) can no longer be reached. In such cases, a curator will 
attempt to derive the information from all available sources. However, despite all efforts, legacy studies 
commonly have only partial assay information, rendering it impossible to determine the exact assay used. 


Such endeavors cannot be automated and require manual curation by a scientist who is an expert in 
the field. 


4.3.2 Standardized Vocabulary 


Data analysis automation efforts rely on compact, non-duplicative, terminology standardized across all 
essential elements of assay information. These standardized terminologies, or code lists, minimize the 
variations and nuances found in natural language, thus supporting scripted data wrangling efforts. For 
example, the “Hematoxylin-Eosin” assay is referred to as “HE,” “H&E,” or other variation of its full name. 
Adhering to standardized names enables easy collection of data generated with this assay across all curated 
studies. While codification of common assays, as well as many analytes such as proteins or genes, is 
straightforward and relies on universal identifiers such as UniProt IDs or HUGO IDs, other assays or 
biomarkers are more challenging. For example, the terminology describing sorted cell subpopulations that 
express a particular set of surface marker combinations is constantly evolving and has not been affected 
by global efforts of harmonization. Consequently, code lists for assay details contain both universal and 
local codes. The assay metadata curation effort therefore requires constant maintenance of code lists that 
ensure the use of universal identifiers whenever possible and are shared between database maintainers 
and users. 


4.3.3 Distinguishing Closely Related Assays 


This issue stems from the broad and ever-growing variety of biomarker assays which is a direct result of 
both technological progress and proliferation of vendors endeavoring to fine-tune each assay to exact 
specifications. This heterogeneity spans multiple aspects, including differences in antibody clones for 
histological analysis, kits used to prepare RNA libraries for sequencing and platforms on which the libraries 
are run. This vast diversity means that instead of discrete, easily distinguishable assays, curators are often 
faced with a spectrum of closely related assays. We have developed a set of rules to group or distinguish 
between seemingly similar assays. For example: (1) Next-generation sequencing uses a specific library 
preparation method, and then this library is sequenced. Sequencing can run for a number of cycles, 
resulting in different read lengths. We consider assays that used different read lengths as the same method, 
as long as library preparation is the same; and (2) Immunoassays that use different monoclonal antibodies 
to the same antigen are different assays, even if all other parameters of the assay are identical. 
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4.4 Process 


There are multiple stages in the process of collecting metadata when registering an assay (creating a new 
record in the database). Curators first explore all available information sources, allowing them to next 
identify the analyte (in standardized terminology, such as UniProt ID). Older studies that use outdated 
protein or gene names or high-throughput assays that test multiple analytes may present complications. 
If assays have a large number of analytes, the analyte list is compared programmatically with similar, 
previously recorded assays to attempt a match. If no exact match is found, the curator must determine 
whether this is a new assay or a subset of the analytes that has been reported from an existing assay. For 
assays that do not have a list of analytes, such as whole genome sequencing which uses library preparation 
methods based on random priming, assays are determined by method and protocol only. 


Next, the curator identifies the vendor, after which he/she can inquire if an assay for the analyte already 
exists within the repository. If yes, metadata for the registered assay is compared to all available data 
protocols to confirm the assay is indeed the same. If the registered assay has a different vendor, a new “site 
assay” can be created with vendor-specific parameters (e.g., quantification limits). If a vendor cannot be 
identified, the curator must gather information about the assay and prepare it for registration. All required 
fields (e.g., general method, submethod, biological matrix, measurement units; see Supplementary Table 1) 
should be identified. If information is not readily available from existing documentation, curators then 
research literature or reach out to the vendor or scientist that performed the assay. Documentation must 
include how information was obtained, complete with reference papers or correspondences. Then, a 
concise and informative assay description may be created. 


After all information is gathered into the repository, a new record is created. The curator executes a QC 
check, and then sends the record to a second curator for an additional QC check. After all issues are 
resolved, the final iteration is submitted to the repository as a new assay. Finally, once the supervisor 
performs a cursory check and approves the assay, an assay identifier is assigned. 


4.5 Quality Control 


Due to format differences and source availability between studies, the assays are manually registered by 
specialized curators with the required domain knowledge. All decisions made are then QC-ed by a second 
specialized curator. The QC curator reviews all available sources submitted by the first curator, including 
communications with internal scientists or the vendor, and compares these to the information entered into 
the database. Programmatic QC steps are used to confirm analytes lists in assays with a large number of 
analytes. 


4.6 Output 


The outcome of the assay FAIRification process is a central repository containing all assays harmonized 
across studies, with a rich metadata layer. Downstream, these efforts ensure that programmatic data 
sectioning across studies considers differences in methodological approaches to determine appropriate 
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covariates, decreasing data noise and artifacts. However, before this occurs, the registered assays need to 
be associated with individual data files which are in turn associated with individual samples and subjects. 


5. LINKING SAMPLES, DATA FILES AND ASSAYS 


Data FAIRification entails transferring data from multiple studies and sources into a central location. 
Large organizations or groups may have legacy data stored across multiple cloud sources and individual 
computers, as well as incoming data from vendors. If data are transferred to a single storage site (such as 
a centralized file repository) it can be accessed organization-wide. It is then given information-rich metadata 
labels, allowing fast searches for data subsets. All files are cataloged and associated with the samples for 
which they were derived and with metadata regarding the assay used to generate them. Ideally, this 
association is concurrent with the data transfer from the vendor. However, metadata is currently assigned 
after the data have already been transferred and stored. In this section, we describe the process for creating 
both the sample-to-file and file-to-assay associations. 


5.1 Assigning Assay Metadata to Data Files 


Incoming data are initially organized by a receiving specialist or a curator group specializing in 
biomarkers. These curators receive the data files and conduct a superficial mining of the data being 
transferred to detect general information, such as the overarching technology used (e.g., histology, FACS, 
next generation sequencing) and place the data files in a hierarchical directory consistent across studies 
and assays. These are supplemented by vendor tags, resulting in a name and path for each data file that 
contains some assay information. The following chapter details the procedure we developed to map assay 
labels (assay identifications) and other assay metadata to data files based on existing data structure and 
available information. This process includes manual and scripted components, and multiple rounds of 
organization and identification to label a large number of data files accurately and efficiently. 


5.1.2 Inputs/Challenges 


The primary inputs for file-to-assay association are the data file names and paths tagged by vendors and 
receiving scientists to include rudimentary assay information. These are treated as strings to be parsed to 
pull metadata. As secondary input into this process, we employ elaborate dictionaries consisting of several 
hundred keywords that link particular substrings within the file path with known files subtypes (e.g., readme, 
manifest), vendors and assays. This approach underwent several rounds of development. Initially, specialized 
curators manually associated files with a particular assay following assay verification or registration (see 
Section 4). Once we had a sufficient number of assays and associated files, we manually generated keywords 
to reliably associate information, such as vendor and the assay, with each file. These keyword dictionaries 
were repeatedly updated, becoming more elaborate and precise. The manual approach for generating these 
keywords reflects the large differences between naming conventions across studies or vendors, as well as 
the lower number of files and assays curated during early stages which precluded a more systemic, machine 
learning-based approach. 
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To estimate the accuracy of the keyword dictionary approach across different technologies, we conducted 
a retrospective analysis comparing the automated assay assignment with post-curation assignment. The 
results (not shown) are highly technology-dependent. Some assays, such as the next-generation sequencing 
assays often achieve over 95% accuracy as these are usually run on highly automated modern platforms 
that generate a well-structured and consistent set of output files meant for scripted analysis. However, this 
approach faces some challenges when dealing with legacy data or methods for which information-rich 
standard output structure has not been established. The primary challenges we encounter are file names 
and paths that lack descriptive information or the available keywords in the file path are not assay-specific 
(as is the case with flow cytometry where the average accuracy is only 9%) 


Another issue is fine-tuning keyword precision (defined as the ratio of true positives to total predicted 
positives). From the standpoint of curation time and efforts, correcting false positive assignment is as time 
consuming as manual assignment. We therefore skewed our dictionary towards longer keywords for higher 
true positives over false positive ratio. For instance, as mentioned before, the common histologic assay 
Hematoxylin-Eosin stain can be abbreviated as “HE,” “H&E,” or take a similar form. Two-letter combinations 
are highly prone to false positive detection and are found in many other words, including drug and sample 
names. However, when we increase keyword length, we also increase false negative detection. Optimally, 
multiple longer keyword permutations should be included. For example, “HE” as it refers to the assay often 
appears as “_HE_” but can also appear in similar permutations such as “_HE,” “HE_,” and “-HE-,” all of 
which we include as keywords associated with the same assay. Taking histology files as example, the result 
of keyword optimization can be seen in the accuracy analysis (data not shown) showing about 50% true 
positive and about 50% no match with negligible number of false positives. As we accumulate more curated 
data that can be used to train classification models, we intend to abandon the keywords in favor of a 
machine learning based approach. 


5.1.3 Process 


Curation begins with a series of simple scripted steps. As all files pertaining to a particular study are 
already grouped into a compact folder structure, it is programmatically simple to generate a complete list 
of files and curator input is not required. Each file listed is tagged with previously known information (e.g., 
study name, the server housing the files). In the next step, file paths from each technology (e.g., ngs, 
histology) are scanned against a specialized set of keywords and regular expressions to derive file extensions, 
file subtypes, batch date, vendor and preliminary primary assay identification. This results in a table that 
lists all files in the study path as well as their programmatically derived metadata. 


Files related to each assay are tagged for “data level.” Data produced directly are machine output and 
recorded observation is “level 1.” Processed data, such as bioinformatics analysis pipeline output, is 
designated as “level 2.” Finally, organizational documents, such as sample manifests and sample flow plans, 
contain no data or derivatives but can be necessary for interpreting or labeling the data. These are labeled 
as “level 3.” Though we derive metadata for all, only files considered primary data, or “level 1,” are assigned 
assay metadata. This process is done programmatically based on file extension and subtype to add a level 
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tag to the various file categories, such as those created automatically by the operating system (such as 
thumbs.db) and documentation files (such as readme.pdf). 


This table is then reviewed and, if necessary, edited by a curator to ensure information such as batch 
date, vendor and primary assay identification were correctly assigned. Primary assay identification and 
vendor combinations are unique, and once both are correctly assigned, a script automatically adds the 
right site assay identification to the table. It is during this manual validation process that curators review 
documents containing assay metadata (see Section 4.2) and register any assays missing from the repository. 
From experience, we have learned that for some technologies, such as next-generation sequencing 
techniques, all files within a single folder are associated with the same assay. Thus, this validation of the 
derived metadata is done on a per-folder basis. This represents a significant reduction in curation efforts as 
the number of folders can be several orders of magnitudes lower than the number of files. Unfortunately, 
for technologies such as histology, folders often contain files associated with different assays and require a 
time consuming per-file validation process. The finalized validated table is then queued for QC. 


5.1.4 Quality Control 


The file metadata table undergoes both manual and scripted QC steps. During manual QC, a second 
curator goes over the table and the associated assay documentation. Automated QC involves over 30 
scripted checks. These include flagging missing values, testing for incorrect format, confirming that all files 
in the list exist on the server and that all files on the server are accounted for in the list. Most importantly, 
the script cross-references the derived data with vendor code list and with the assay registry. 


5.1.5 Output 


The final table contains metadata including assay identification, vendor, batch date, file extension and 
subtype, for all files containing primary data (Figure. 1). 


5.2 Linking Samples to Data Files 


To complete the linkage of metadata for subjects-samples-assays, these objects should be mapped to the 
data files associated with them. The rest of this section will describe the process of mapping samples to 
files associated with those samples, highlighting the differences between mapping high-dimensional and 
low-dimensional data. 


5.2.1 Inputs/Challenges 


After the data have been transferred and organized into the centralized fileshare location for each study, 
the file path serves as the main input that allows the curators to programmatically find all files containing 
sample identifiers within the study folder. In addition, the output table mapping subjects to samples (see 
Section 3.5) serves as a secondary input that allows identification of samples that were not linked to subjects 
for QC purposes. 
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While the process of extracting the sample identifiers from files is conceptually simple and can be mostly 
automated, major challenges in sample-to-file mapping include variability in types and formats of files 
containing sample identifiers, inconsistency in variable names containing the sample identifiers within these 
files, and differences and non-uniqueness in alphanumeric formats for the sample identifiers themselves. 


5.2.2 Different File Types and File Formats 


For low-dimensional data where the results for multiple samples are typically collected into data summary 
files, the most common file types are text and delimited text, Excel and SAS. These files typically contain 
tabular data in a wide variety of formats. Besides the preferable “tidy” data format [22] where each row 
belongs to a unique sample and sample identifiers can be found in just one of the variables, we have 
encountered multiple other file formats. Some examples include files with more than one variable per 
sample identifier, where the data table is preceded by a header of several rows, sample identifiers can be 
contained in columns or in rows, and Excel files that contain multiple worksheets (with sample identifiers 
contained in one or several of those worksheets). These variations make it challenging to fully automate 
sample identifier extraction from all data files, as a curator must intervene when a file with a new format 
appears that could not be processed by a script. 


High-dimensional data are often communicated by vendors as one file per sample per run. For files 
containing high-dimensional data, sample identification is usually embedded as part of the file name. For 
this type of file, successful extraction of the sample identifier depends on how easily a sample identifier 
can be distinguished from the other alphanumeric characters in the file name. 


5.2.3 Variability in Column Names Containing Sample Identifiers 


For the low-dimensional files, automation of sample identifier extraction depends on reliable identification 
of the columns (or rows) that contain the sample identifiers. In practice, however, sample identifier column 
names can vary. While the script can choose a column name from an array of all possible names encountered 
in previous studies, sometimes a new column name could result in a subset of files that cannot be processed 
by standardized script and must be processed separately. 


5.2.4 Non-unique Formats for Sample Identifiers 


Lack of format standardization for sample identifiers can result in non-unique simplistic formats that lead 
to difficulties with sample identifier extraction from the file names. For high-dimensional data where sample 
identifiers are part of the file name, scripts rely on regular expressions to distinguish and extract the sample 
identifier from the rest of the file name. If the sample identifier format is not sufficiently unique, regular 
expressions may not be able to distinguish between the sample identifier and the rest of the file name. 
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5.2.5 Duplicated Sample Identifiers 


Ideally, every sample identifier should describe a unique sample. However, there are instances when 
Subject-Sample mapping output contains duplicated sample identifiers. Since the sample identifiers 
associate files with samples during the sample-to-file mapping process, samples with non-unique identifiers 
are linked with their data files as well as the data files related to samples with shared identifiers. This 
generates a cascade of errors not only in sample-to-file mapping but also in sample-to-assay mapping. This 
sometimes results from poor sample management practices when the same sample identifier is used for 
parent and daughter samples by the data providers. 


5.3 Process 


The process of sample-to-file mapping consists of running standardized scripts to extract sample identifiers 
from file names in high-dimensional data or read in the files, extract the sample identifiers from low- 
dimensional data to produce tabular output with file names, file paths and sample identifiers. Separate 
standardized scripts were created for every technology (e.g., ngs, histology) to accommodate file type and 
format variability. Part of the design process for the standardized scripts was to incorporate feedback from 
every new study mapped to include new variable names and formats for the files for sample identifier 
extraction with fewer errors. 


Once the standardized scripts are run for every technology, the curator must check that sample identifiers 
were extracted correctly and that all the files containing sample identifiers have been processed by the 
scripts. If some data files could not be processed by the scripts (for example, due to an unexpected file 
format or a new name for a variable containing sample identifiers), the curators add the necessary code to 
complete the sample-to-file mapping process and correct any errors. 


When non-unique sample identifiers are present in the Subject-Sample output, it could be necessary to 
include additional variables to the sample-file table to uniquely identify the samples. 


5.4 Quality Control 


Quality control checks, both manual and scripted, are incorporated into the sample-to-file mapping 
process. After the standardized scripts produce the tabular output containing file paths, file names and 
sample identifiers extracted from these files (or sample identifiers extracted from the file names in case of 
high-dimensional data), the curator visually checks the extracted sample identifiers for errors. Following 
that step, the curator runs a QC script comparing the list of data files from the output to the total list of 
data files in the file repository and reporting the difference. The curator then goes through the list of 
unmapped files to see if these files contain sample identifiers. If unmapped files contain sample identifiers, 
the curator adds the necessary code to read in those files, extract the sample identifiers and add the data 
to the output. Next, the curator runs another QC script that compares the extracted sample identifiers to 
the sample identifiers from the Subject-Sample mapping process described in Section 3. This comparison 
identifies sample identifiers missed by the Subject-Sample mapping process and catches incorrectly 


646 Data Intelligence 


202211.00203v1 


chinaXiv 


ChinaXiva /ERAT 
Implementation of the FAIR Data Principles for Exploratory Biomarker Data from Clinical Trials 


extracted sample identifiers. Following this verification, only the rows with sample identifiers found in the 
Subject-Sample mapping process remain in the sample-to-file mapping output. Finally, another curator 
performs a QC check ensuring all steps were followed correctly. 


5.5 Output 


The curated output is written to a file in standard .csv format. This file contains all the file paths and 
names for the primary data and sample identifiers extracted from those files (Figure 1). Coupled with two 
additional tables, one which records sample metadata and one linking assay metadata to primary data files, 
these tables allow unique mapping of all samples to standardized assays and vice versa. 


6. PLANNING FOR FAIR DATA 


With our previous experience in mind, the goal of the future process is to include FAIR data requirements 
at the time of data generation planning. This has several advantages to the retrospective method. (1) time 
to prospectively plan is far less than time needed to retrospectively curate; (2) metadata requirements can 
be determined in advance for a given data type and applied consistently to different sources of the same 
data type; and (3) data format standardization allows for parsing and/or processing of the data through 
pipelines programmatically, further reducing the time and effort needed to deliver the data to scientists and 
analysts. 


In general, the planning phase should include the data management team, the biomarker scientist(s), and 
a metadata expert. These three components work together to assess the biomarker needs, select specific 
assays, define the laboratories that will generate the data, define the required metadata and communicate 
requirements to the laboratories that will generate the data. A file manifest will be required in which every 
file is associated with both the assay identifier and the sample identifier of the sample used to perform the 
assay and produce the file. The file manifest will provide the link between data file, sample, and assay. In 
addition, sample metadata, batch-level metadata, and quality control results will be required in the sample 
manifest. The two manifests together will provide all necessary metadata for FAIR representation of the data 
at time of data transfer and will allow for improved automation through data processing pipelines. 


Planning phase effort can be minimized by creating a library of metadata templates for the most common 
assay methods from which the data planning team can select. The library object required for each assay 
can be linked to the assay repository such that when an assay is identified from the repository, the required 
metadata template is provided. The templates include the required and expected metadata definitions and 
format as well as instructions for the data-generating laboratory to transfer the data. If the required assay is 
not yet registered to the assay repository, it can be defined and registered as part of the planning phase, 
and the appropriate metadata template (existing or new) associated with it. The metadata library can be the 
basis for a series of tools to support the data planning and receiving teams. For any given instantiation of 
a library object for a specific dataset, a tool can be created to automate the data conformance to the 
standard, call the appropriate parsing or pipeline routines and set access controls. 
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7. CONCLUSION 
7.1 FAIRification Effort 


There is a business culture aspect to collecting data and metadata: in a broad curation project such as 
described here, many people across the organization provide input (documents, manifests, data locations, 
standards, etc.) to the process. Directive, advocacy and expectation from the top of the organization is 
highly effective for embedding a data culture throughout the organization. In the efforts described here, 
collaboration between the curation team and data owners is imperative to the success and efficiency of the 
process. When all parties agree on the value of the effort, processes can be developed across lines to meet 
the needs of the individual teams while also supporting FAIRification efforts across the organization. 
Engagement with the standards owners is also critical to success. Biomarker methods evolve very quickly, 
and new methods are regularly introduced. Standards for these methods and the metadata associated with 
them often lag behind the incorporation of the methods by clinical teams, resulting in a need to develop 
internal standards. 


The majority of the work described in this paper addresses the findability and reusability of existing data 
collected over many years. The process is time-consuming, labor intensive, and aspects of metadata are 
often lost when not captured at the time the data are generated. This paper has described a semi-automated 
process to bring existing clinical trial biomarker data into FAIR recommendations. This process is particularly 
labor intensive due to the nature of how the existing data were collected in the absence of expectation for 
deep metadata. Several factors impact the effort required to curate a study, including the degree of change 
between the original version of standards and the version of standards to be applied, the number of 
technologies used to generate the biomarkers, and the total number of data sets to be curated. The curation 
team has provided a yearly average of approximately 10 Full Time Equivalent effort over 3 years thus far to 
this ongoing effort, resulting in over 700,000 samples mapped to more than 90,000 clinical trial participants 
and 4 million data files. Simultaneously, the team has curated 400 unique assays and assigned the resulting 
unique assay identifiers to more than 2 million files. While the output of the team grew faster than the size 
of the team (Figure 2) by improving processes and quality control review time with semi-automation, the 
effort reported here has curated approximately 25% of the total existing biomarker file repository. It is 
difficult to substantially improve the efficiency of this process when curating existing data collected under 
non-standard conditions, emphasizing the need for FAIR planning. The goal of the future planning process 
is to greatly improve this efficiency of creating FAIR data by collecting the metadata proactively and in a 
standard form. 
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Curation Effort and Output Over Time 
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Figure 2. Curation effort and output. The total number for each year is provided for the following effort and output 
characteristics: Full Time Equivalents (FTE)—defined as one person for 40 hours per week for 50 weeks; Subjects 
to Samples—the number of unique subjects that have been mapped to at least one curated sample; File to 
Samples—the number of unique files that have been mapped to the sample(s) for which data were reported in the 
file; and Files to Assays—the number of unique files that have been mapped to the curated assay(s) that produced 
the reported data. 


In view of the labor-intensive nature of the data FAIRification process described in this paper, it is 
reasonable to question if there are cost-saving alternatives to manual data curation and if there exists a fully 
automated data processing and curation solution, possibly leveraging the power of Al. Such fully automated 
solution has been considered and tried using one of the commercial data unification platforms that offered 
to leverage machine learning and other advanced algorithms to curate data at scale. Ultimately, only the 
semi-automated approach described in this paper achieved data FAIRification and integration at scale. Lack 
of consistency in data structure, content and format, data diversity (for each subset of data type, vendor, 
assay, etc.) and emergence of new types of data with time are among the contributing factors to the 
challenge of fully automating clinical and biomarker data curation. In our experience, there is no good 
alternative to the semi-automated process described above for the FAIRification of legacy data. As for the 
future studies, the manual component of the curation can be best minimized by addressing the inefficiencies 
in the business processes, rather than seeking advanced technical solutions. As described above in the 
Planning for FAIR Data section, planning and formalizing consistent data formats in advance of receiving 
data will significantly reduce costly manual curation efforts and speed up automation using scripted 
approaches developed in this work. 


One aspect of FAIR, interoperability, has not been fully implemented in the current approach. In the 
efforts described here, harmonization of metadata has been achieved. However, the FAIR data principles 
advocate that the harmonized controlled terms be mapped to Universal Resource Identifiers (URI) where 
the identifier is stable and can be mapped to various synonyms from source ontologies and terminologies. 
Practically, the URI is often a complex construct and is not amenable to human interpretation of the data 
so that the URIs need to be converted back to one of the terms that the user community can interpret. This 
concept is important to provide different views of the data to different consumers, where terms preferred 
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by one community may differ from another. The use and value of URIs has been well covered [12, 23, 24, 25] 
and the implementation of a terminology or ontology system to serve and translate URIs is out of scope of 
this current work. 


7.2 FAIR Data Value 


FAIRification of data greatly improves the ability to reuse data and by extension share it for further 
analysis either internally or externally. The processes described here bring metadata and data together in a 
way that only a single access point is needed to obtain data for analysis projects, whereas previously the 
information was scattered and disconnected. While this greatly improves overall accessibility to the data, 
individual access controls and sharing policies must be developed and maintained in concordance with 
local laws and study documents. Minimally, most clinical data sets will require access controls to conform 
to privacy and data sharing requirements associated with patient informed consent documents or other 
clinical study materials. Broader access can be appropriate with certain data anonymization practices, 
typically performed after the study is closed, cleaned and curated [26, 27, 28]. 


There are numerous options for sharing the data with the community. Choice of implementation should 
be based on user requirements. Some user communities will want to access processed data programmatically 
while others will require that data be presented as completed analyses. Importantly, the FAIRification 
process should prioritize community needs and several sharing environments can be implemented from 
the same data repository. 


From a business perspective, the outcome of this effort has been broad and varied. The Subject-Sample 
mapping was the original process implemented and enabled the early integration of RNA-Seq data sets 
with the clinical attributes for several internal data marts. While allowing for appropriate access controls, 
these data marts facilitate the ability of data scientists to readily find the data and re-use it to address 
scientific questions that were previously considered too time-consuming or even unapproachable. Outcomes 
of these analyses include identification of potential new targets for cancer immunotherapy, and ability to 
quickly address product questions from government agencies. Additional access options, such as database 
interfaces and visualization tools, have been implemented where the biomarker scientist community has 
independently investigated these data sets for simple relationships such as the behavior of a biomarker in 
different patient populations, performance of different biomarker assays, or correlation between biomarkers 
and genetic markers. Previously, these activities would have required effort from the biostatisticians or data 
scientists to find, merge, clean and analyze the data, presenting a huge barrier to such exploration. These 
questions are now readily addressed, advancing research initiatives quickly and efficiently. In addition, 
more formal internal analytical “challenges” have leveraged these clean data sets to crowd-source analytical 
efforts to further scientific understanding, propose new treatment hypotheses, and drive new treatment 
development. The improved analytic capability achieved from the data marts is a direct result of implementing 
the FAIR principles: improving findability of data previously disconnected from the clinical record while 
applying robust access controls appropriate to the use and users of the data, which in turn support the 
re-use of the studies for more powerful analyses. Over time, the processes evolved and matured to include 


650 Data Intelligence 


202211.00203v1 


chinaXiv 


ChinaXiva {EHRT 
Implementation of the FAIR Data Principles for Exploratory Biomarker Data from Clinical Trials 


the additional assay metadata and relationships described above. The additional depth of metadata is now 
available through a series of tools that allow scientists to explore existing data to determine if data already 
exist to meet their needs, saving precious time and resources on the path to anticipating and achieving 
patients’ needs. 
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